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Vowels are the primary units of a sound system of a language. The 
classification of these vowels is therefore very important for the recognition 
and synthesis of speech. In this paper, we propose a normalized energy-based 
approach in formants and pitch to characterize Arabic vowels (short vowels: 
/a/l,/il/,/u/; long vowels: / a: /,/i: /,/u: /). The classification was performed 


using a developed algorithm on records extracted from an Arabic corpus after 


the extraction of the pitch and the first three formants and the computation of 
Keywords: the normalized energy in these bands. The results showed that the algorithm 
distinguishes Arabic vowels by analyzing the normalized energy in the 
nucleus of F1, F2, and F3 formants and pitch FO with a rate of 88.7% for long 
vowels and a rate of 90% for short vowels. 
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1. INTRODUCTION 

The production of vowels is dependent on the flow of air in the cavities above the glottis (oral cavity 
and nasal cavity). Vowels are usually voiced sounds, created by the vibration of the vocal cords which results 
in a higher spectrum amplitude in the low and medium frequencies compared to the consonants [1]-[3]. This 
produces a maximum of spectrum amplitude in the low and medium frequencies during the production of the 
vowel. These frequencies are called formants. It has been shown that the frequency of the formants is very 
important in the determination and identification of a vowel. The listening experiments of Peterson and Barney 
made it possible to map the first two formants (F1 and F2). Higher formants also play a role in the identity of 
vowels [4]-[9]. 

The first formant is associated with changes in the opening of the mouth. F1 frequencies of sounds 
requiring small mouth openings are located at low frequencies and those requiring wide mouth opening at high 
frequencies while the second formant is associated with changes in the oral cavity such as the position of the 
tongue and the activity of the lips. On the other hand, the third formant is associated with a front-to-back 
constriction in the oral cavity [10]. 

Based on these observations, many studies have been realized on formants to identify vowels. 
Lulich et al. [11] analyzed the second formant in the case of coupling between the vocal tract and the lower 
respiratory tract (subglottic). He showed that the amplitude discontinuity due to the passage of the second 
formant (F2) through the second subglottal resonance plays an important role in the perception of vowels. This 
is consistent with [12], who was able to demonstrate the role of the second subglottic resonance in the 
distinction of vowels. Natour et al. [13] studied the acoustic characteristics of the normal Arabic voice by 
analyzing the formants F1, F2, and F3. He showed that the frequencies of formants among Jordanian Arabic 
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speakers show a significant decline in men, women and children compared to other racial backgrounds and 
dialects. Other studies proposed another perspective of analysis of formants. Alotaibi and Hussain [14] 
analyzed Arabic vowels by studying the values of the first and second formants in a context of consonance- 
vowel-consonance (CVC) enunciation. This study allowed to classify Arabic vowels using a hidden Markov 
model (HMM) recognizer. 

Vowels have been examined in the past using duration cues. The duration has been shown by 
Alotaibi and Hussain [15] to be crucial in differentiating between short and long vowels. A correlation between 
speech rate and vowel duration has been demonstrated by other researchers. Therefore, if speech tempo 
quickens, vowel duration shortens [16]. According to Mok [17], vowel duration has lately emerged as a crucial 
acoustic cue for vowel identification and speech understanding. Additionally, Khattab and Al-Tamimi [18] 
observed no discernible differences in durational outcomes between males and females. 

Researchers have also considered spectral moments to characterize vowels based on the fact that the 
vowel spectrum reflects both the glottal source's characteristics and the vocal tract filter's functions, and that 
the spectral moments combine the glottal source's spectral amplitude model and the vocal tract filter's 
resonances [19]. Tahiry et al. [20] showed that center of gravity (CoG) and STD reveals two phases of 
production of vowel. The start of production is characterized by a transient regime followed by a stationary 
state as the duration increases. Savela et al. [21] found that the identification of vowels can be based on spectral 
moments. Indeed, several spectral attributes in the space of the vowels allow the evaluation of the quality and 
the classification of the vowels. Pentti [22] analyzed the spectral moments to compare the production of /S/ of 
young children to adults. The results of this study showed significant effects of vowel coarticulation on spectral 
characteristics especially in symmetrical vowel contexts. 

Another aspect of the study was able to demonstrate that the energy distribution in the frequency bands 
can characterize Arabic vowels according to the duration of production. Furthermore, the percentage 
distribution of energy in these frequency bands appears to be unaffected by the time of production [23]. Due 
to the distribution of formants F1 and F2, they demonstrated that the majority of energy for the vowel /a/ is 
located in the first five bands B1, B2, B3, B4, and B5. The distance between these formants 
(F1>600 Hz and F2>1,000 Hz) leads to an energy distribution that covers the five frequency bands. The vowel 
/i/'s energy is concentrated in the first and fifth bands (B1 and B5). This is explained by the fact that the first 
formant F1, in addition to the energy produced by vocal sounds glottic vibration (400 Hz), is positioned in the 
low frequencies (F1 300 Hz). Because the first two formants F1 and F2 are concentrated in the low frequencies, 
the main energy for the vowel /u/ is concentrated in the first band B1. Another feature extraction approach 
based on a frequency response model of the vocal tract has been presented. Paulraj et al. [24] examined the 
mean and maximum energy amplitudes as characteristics in fixed frequency frames between 20 and 2,500 Hz. 
The energy characteristics obtained made it possible, using multinomial logistic regression (MLR), to detect 
the vowels /a/, /e/, /i/, /o/, and /u/. 

Inspired by previous works related to Arabic vowels recognition, the purpose of this work is to define 
another aspect of study of the Arabic vowels. The findings presented in this article are based on an acoustic 
study of Arabic vowels. The primary goal is to identify Arabic vowels (short and long ones) according to the 
normalized energy bands characterizing the first three formants and pitch. 

This paper is arranged in the following manner. Section 2 defines the baseline methods and means 
employed and describes the experiments achieved. Section 3 discusses the results. Section 4, describes the 
summary and conclusions of this work. 


2. METHOD 
2.1. General processing 

This part defines the methodology used to describe the behavior of Arabic vowels through a series of 
experiments. It also provides a detailed description of the data collected as well as the tools used in this study. 
Ten Moroccan were asked to pronounce isolated syllables CV (C is a consonant and V is one of the vowels /a/, 
/i/ or /u/) with short and long vowels. We constructed the corpus using the consonant /?/: /s/ given its minimal 
effect on the vocal tract (Table 1). 

The recordings were digitized with a sampling frequency of 22,050 Hz. By isolating each vowel via 
"Praat," useful signals were extracted from the recordings. These signals are then segmented into 11.6 ms 
segments and windowed using the "hamming window," followed by a 512-point fast fourier transform (FFT) 
and zero-padding. For this study, the first three formant frequencies were calculated using the linear predictive 
coding (LPC) method. Then, the energy in each of these three formants was deduced. 


2.2. Details formants extraction method 
The basic problem of the LPC system is to determine the formants from the speech signal in the form 
of a linear combination of previous samples. The LPC method conducts a spectral analysis of the input signal 
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by cutting it into frames, based on the premise that the speech signal is generated by a buzzer placed at the end 
of a tube. A linear combination of filter parameters and previous samples predicts output samples. The buzz is 
produced by the gap between the vocal cords named glottis and is distinguished by its strength and frequency 
(pitch). The vocal tract constitutes the tube which is characterized by its spectral envelope peaks. These peaks 
are called formants. Figure 1 shows the pre-processing steps performed on the speech signal to extract the 
formants. 


Table 1. Arabic corpus of long and short vowels 
Vowel /a/__ Vowel /i/__ Vowel /u/ 


lal hil /u/ 
/aal hii/ /uu/ 
/aaa/ /iii/ /uuu/ 
/aaaa/ /iii/ /uauu/ 
/aaaaa/ /iiii/ /uuuuu/ 


Framing and 
windowing 


Computing the 
LP coefficients 


Computing the 
LP spectrum 


Spectral peaks 


Detection of 
formants 


Figure 1. Chart of the detection procedure of formants with LPC [13] 


After evaluating the tendency of the first three formants (F1, F2, and F3) for Arabic vowels 
(/a/, /i/, et /u/), we found that these three frequency bands present significant acoustic changes defined for each 
vowel. Furthermore, the pitch frequency (FO) is to consider since the vowels are voiced sounds. The three 
frequency bands considered in this work are: 
- /a/ Band FO: 0-400 Hz; F1: 500-800 Hz; Band F2: 1,000-1,500 Hz; Band F3: 2,200-2,800 Hz 
- /if Band FO: 0-400 Hz; Band F1: 100—400 Hz; Band F2: 2,000-3,000 Hz; Band F3: 2,800-3,400 Hz 
- /u/ Band FO: 0-400 Hz; Band F1: 300—600 Hz; Band F2: 600—1,100 Hz; Band F3: 2,400-3,000 Hz 


2.3. Normalized energy computation in formants 

The amplitude of the spectrum was smoothed in each segment by taking 20 average points along the 
time index n. The calculations focused on the three bands that represent each vowel's three formants, 
F1, F2, and F3. The frequency index k varies between each band's lower and upper limits. The energy in each 
band was measured using the following formula: 


Ep (n) = Xx 10 l0g10(|X (n, k)I*) (1) 


where |X(n, k)| denotes the spectrum amplitude and E,(n) denotes the band energy b in segment n. The 
normalized energy band for each segment was then determined as (2): 
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Epn (n) = 2% (2) 


Er(n) 


Given a segment n, Epn (n) is the normalized band energy b in this segment and Er (n) is the total energy. 


3. RESULTS AND DISCUSSION 
3.1. Energy distribution in formants 

The objective of this section is to analyze the normalized energy distribution of Arabic vowels 
(three short: /a/, /i/, /u/ and three long vowels: /a:/, /i:/, /u:/) in the frequency bands relating to the first three 
formants (F1, F2, and F3) and pitch FO. The first three formants are more significant for the characterization 
of Arabic vowels. The results obtained are summarized in Table 2. The analysis of these data allowed us to 
observe that a minimum of 70% of the energy is focused in the frequency bands FO (pitch), Fl, F2, and F3 
(formants) for each of the vowels /a/, /i/ and /u/. 

We notice an important energy in the band FO for the three vowels. This is explained by the vibration 
of the vocal cords at the moment of the production of the vowel. For the vowel /a/, the bands FO and F1 are 
isolated due to the position of the tongue (Figure 2). Indeed, this case can be explained by the fact that the 
closer the tongue is to the roof of the mouth, the lower is the frequency of the first formant. 


Table 2. Average energies in bands FO, F1, F2 and F3 

Em_FO Em_Fl Em F2 Em F3 Total 
Jal 21.42 26.20 18.73 1.84 68.19 
Ja! 40.44 17.91 16.39 2.12 76.85 
Al 51.75 9.55 6.91 68.22 
hil 60.10 9.90 7.52 77.53 
/u/ 38.93 27.87 11.19 0.48 78.46 
fur) 59.34 15.07 15.42 0.14 89.97 


Figure 2. Vowel articulatory movements (/a/, /i/ and /u/) from [10] 


The high energy density in the band FO for the vowel /i/ is explained by the fact that the pitch and first 
formant frequency bands are combined. The fact that the frequency bands corresponding to the pitch frequency 
and the first formant overlap for the vowel /u/ explains the significant energy in the bands FO and F1. We notice 
for the long vowels an increase of the energy of the pitch due to the important energy demand for the vibration 
of the vocal cords in order produce a long vowel. 

We observed that the vowel /i/ has the highest percentage of energy in the band F1 compared to the 
vowels /a/ and /u/ after examining the formants bands energy distribution. In the frequency band corresponding 
to the first formant, the vowel /i/ (short and long) has an energy rate greater than 50%. The vowel /i:/ can be 
easily distinguished from the vowel /i/. In comparison to the vowel /i/, the vowel /i:/ has a higher energy rate 
than the vowel /i/. Moreover, the scrutiny of the energy distribution in the band F2 reveals that unlike the 
vowels /a/ and /i/, this band is important in distinguishing between short and long vowels for the vowel /u/. 
Regarding the distribution of energy in the band F3, we noticed that this data is not significant for the 
characterization of Arabic vowels. 


3.2. Algorithm 

The results outlined in the previous section demonstrate that it is possible to identify Arabic vowels 
based on the percentage of energy distribution in the frequency bands characterizing the first three formants 
and pitch. These results allowed us to develop an algorithm, which help to distinguish between short and long 
Arabic vowels. The speech input undergoes several stages of processing. The spectrogram is first calculated. 
Then, an energy waveform is constructed in the first band, the energy derivative is calculated and peaks in the 
derivative are detected. The results of this step feed the g-landmark processing step to determine vowel from 
the syllables CV (Figure 3). The frequency bands characterizing the first three formants and pitch are then 
determined to calculate the normalized energy equivalent to these bands. 
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Figure 3. Vowel (v) localized between g+ and g—landmarks (white represent energy in first band) 


The result of this treatment is injected at the classification stage to identify the concerned vowel. The 
results of the classification stage are implemented in the algorithm given in Figure 4, which allows perceiving 
short and long vowels. The efficiency of this algorithm was determined using our corpus. The obtained results 
are summarized in Tables 3 to 5. It can be seen that, for vowels /a/ and /u/, the certainty of this algorithm is 
above 90% (96.55% for /a/, 91.67% for /a:/, 94.29% for /u/ and 94.44% for /u:/). For the vowel /i/, the algorithm 
is less efficient with an accuracy of around 80% (80.56% for /i/ and 80% for /1:/). 


Speech input 


Energy and its iterative 


g Landmarks 


i 


Computation of formants for classification of Arabic 
Vowels 


Arabic Vowels classification 


/a/ short and long /u/ short and long /i/ short and long 


ful 
Yes No 
/u:/ 


Figure 4. Algorithm of classification of Arabic vowels 


Table 3. Confusion matrix of vowels /a/ and /a:/ Table 4. Confusion matrix of vowels /u/ and /u:/ 


/al___/a:/__ Overall percent-tage of errors (%) /u/__/u:/__ Overall percent-tage of errors (%) 
Jal 28 1 3.45 ful 33 2 5.71 
lal 4 44 8.33 lu:/_ 2 34 5.56 
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Table 5. Confusion matrix of vowels /i/ and /i:/ 
/i/ Ni:/__ Overall percent-tage of errors (%) 
fil 29 7 19.44 
fi:/ 8 32 20.00 


The energy distribution of Arabic vowels shows that their behavior is dependent on where they are 
articulated. The energy in the frequency bands of formants FO (pitch), Fl, F2, and F3 has been found to 
characterize Arabic vowels in this research. It was possible to differentiate between short and long vowels. 
These results are very competitive compared to those reported in the literature [23], [25]-[27]. 


4. CONCLUSION 

This work has shown a new approach for the characterization and classification of Arabic vowels 
according to acoustic cues. This study was based on the energy percentage in the formants and pitch frequency 
bands (/a/: Band FO: O-400 Hz; Band Fl: 500-800 Hz; Band F2: 1,000-1,500 Hz; 
Band F3: 2,200-2,800 Hz, /i/: Band FO: 0-400 Hz Band F1: 100-400 Hz; Band F2: 2,000-3,000 Hz; 
Band F3: 2,800-3,400 Hz, /u/: Band FO: 0-400 Hz Band F1: 300-600 Hz; Band F2: 600—1,100 Hz; 
Band F3: 2,400-3,000 Hz). The obtained results show that the energy in bands related to the formants F1, F2 
and F3 as well as the pitch allow first to identify the nature of the Arabic vowel then to distinguish between 
the long and short ones. Classification experiments were performed on Arabic vowels extracted from our 
Arabic corpus. The findings yielded to an overall classification of 89.29%. 
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